To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.
translated by 谷歌翻译
Our education system comprises a series of curricula. For example, when we learn mathematics at school, we learn in order from addition, to multiplication, and later to integration. Delineating a curriculum for teaching either a human or a machine shares the underlying goal of maximizing the positive knowledge transfer from early to later tasks and minimizing forgetting of the early tasks. Here, we exhaustively surveyed the effect of curricula on existing continual learning algorithms in the class-incremental setting, where algorithms must learn classes one at a time from a continuous stream of data. We observed that across a breadth of possible class orders (curricula), curricula influence the retention of information and that this effect is not just a product of stochasticity. Further, as a primary effort toward automated curriculum design, we proposed a method capable of designing and ranking effective curricula based on inter-class feature similarities. We compared the predicted curricula against empirically determined effectual curricula and observed significant overlaps between the two. To support the study of a curriculum designer, we conducted a series of human psychophysics experiments and contributed a new Continual Learning benchmark in object recognition. We assessed the degree of agreement in effective curricula between humans and machines. Surprisingly, our curriculum designer successfully predicts an optimal set of curricula that is effective for human learning. There are many considerations in curriculum design, such as timely student feedback and learning with multiple modalities. Our study is the first attempt to set a standard framework for the community to tackle the problem of teaching humans and machines to learn to learn continuously.
translated by 谷歌翻译
VQA是一项雄心勃勃的任务,旨在回答任何与图像有关的问题。但是,实际上,由于用户的需求不断更新,并且该系统必须实施新功能,因此很难为所有人构建这样的系统。因此,持续学习(CL)能力是开发高级VQA系统的必要条件。最近,先锋工作将一个VQA数据集分为不相交的答案集以研究此主题。但是,VQA上的CL不仅涉及标签集的扩展(新答案集)。在将VQA系统部署到新环境(新的视觉场景)以及如何回答需要新功能的问题(新问题类型)时,研究如何回答问题至关重要。因此,我们提出了Clove,这是一个在视觉问题答案上连续学习的基准,其中包含上述两个CL方案的场景和功能收入设置。在方法论方面,VQA和分类的CL之间的主要区别在于,前者还涉及扩大和防止忘记推理机制,而后者则集中在班级表示上。因此,我们提出了一种为CL上量身定制的基于无数据的基于Real-DATA的基于VQA上的方法,称为场景图作为符号重播的提示。它使用一段场景图作为提示,它可以重播伪场景图,以表示过去的图像以及相关的QA对。还提出了一个统一的VQA模型来利用当前和重播数据来增强其质量检查能力。最后,实验结果揭示了丁香的挑战,并证明了我们方法的有效性。数据集和代码将在https://github.com/showlab/clvqa上找到。
translated by 谷歌翻译
不平衡的培训数据是医学图像分类的重大挑战。在这项研究中,我们提出了一个新型的渐进式中心三重态(PCCT)框架,以减轻类不平衡问题,尤其是用于诊断稀有疾病的问题,主要是通过仔细设计三重态采样策略和三重态损失形成。具体而言,PCCT框架包括两个连续的阶段。在第一阶段,PCCT通过类平衡的三重损失训练诊断系统,从而使不同类别的分布分布粗糙。在第二阶段,PCCT框架进一步改善了诊断系统,涉及三胞胎损失,从而导致每个类别的分布更紧凑。对于级别平衡的三重态损失,在每个训练迭代中为每个班级平均采样三重态,从而减轻了不平衡的数据问题。对于涉及三胞胎的集体中心损失,每个三重态中的正和负样本被其相应的类中心取代,该中心强制执行靠近类中心的同一类的数据表示。此外,涉及的三胞胎损失涉及的中心损失将扩展到成对的排名损失和四倍体损失,这证明了所提出的框架的概括。广泛的实验支持PCCT框架有效地用于医疗图像分类,并使用不平衡的训练图像。在两个皮肤图像数据集和一个胸部X射线数据集上,建议的方法分别获得了所有类别的平均F1得分86.2、65.2和90.66,以及81.4、63.87和81.92的稀有班级,即可实现最罕见的班级。性能并超越广泛使用的类不平衡问题的方法。
translated by 谷歌翻译
认知科学表明,人类会以所见主体的变化分离的事件来感知视频。状态变化触发新事件,是大量冗余信息中最有用的事件之一。但是,先前的研究重点是对细分市场的总体理解,而无需评估内部的细粒度变化。在本文中,我们介绍了一个名为Kinetic-GEB+的新数据集。该数据集由与标题相关的170K边界组成,这些字幕描述了12K视频中通用事件中的状态更改。在这个新数据集中,我们提出了三个任务,支持通过状态变化开发对视频的更细粒度,健壮和类似人类的理解。我们在数据集中评估了许多代表性基线,在该基础上,我们还设计了一种新的TPD(基于时间的成对差异)建模方法,以进行视觉差异并实现显着的性能改进。此外,结果表明,在利用不同粒度,视觉差异的表示以及状态变化的准确定位方面,当前方法仍然存在着巨大的挑战。进一步的分析表明,我们的数据集可以推动开发更强大的方法来了解状态变化,从而提高视频级别的理解。该数据集可从https://github.com/yuxuan-w/geb-plus获得
translated by 谷歌翻译
AR眼镜/机器人等智能助手的长期目标是帮助用户以负担得起的现实世界情景,例如“我如何运行微波炉1分钟?”。但是,仍然没有明确的任务定义和合适的基准。在本文中,我们定义了一项名为“负担中心问题驱动的任务完成”的新任务,AI助手应从教学视频和脚本中学习,以指导用户逐步指导用户。为了支持该任务,我们构建了AssistQ,这是一个新的数据集,其中包括531个问答样本,该样本来自100个新电影的第一人称视频。每个问题都应通过从视觉细节(例如按钮的位置)和纹理细节(例如,按/转弯之类的操作)推断出多步导完成。为了解决这一独特的任务,我们开发了一个问题对行为(Q2A)模型,该模型极大地超过了几种基线方法,同时仍然有大量改进的空间。我们希望我们的任务和数据集能够推进Egentric AI助手的发展。我们的项目页面可在以下网址找到:https://showlab.github.io/assistq
translated by 谷歌翻译
它仍然是一个管道梦想,电话和AR眼镜的AI助手可以帮助我们的日常生活来解决我们的问题,如“如何调整这款手表日期?”和“如何设置加热持续时间?(指向烤箱的同时)”。传统任务中使用的查询(即视频问题应答,视频检索,时刻定位)通常是有关的,并基于纯文本。相比之下,我们提出了一项名为Cometdancy的问题驱动视频段检索(AQVSR)的新任务。我们每个问题都是一个图像框文本查询,专注于我们日常生活中的物品,并期望从教学视频转录程序段的语料库中检索相关的答案段。为了支持对此AQVSR任务的研究,我们构建一个名为AssionSR的新数据集。我们设计新颖的准则来创造高质量样本。此数据集包含有关1K视频片段的1.4K多模态问题,来自各种日用物品的教学视频。为了解决AQVSR,我们开发了一个称为双重多模式编码器(DME)的简单但有效的模型,显着优于几种基线方法,同时仍然有大型未来改善空间。此外,我们提供了详细的消融分析。我们的代码和数据可以在https://github.com/stanlei52/aqvsr中获得。
translated by 谷歌翻译
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications (i.e., phone unlocking) while lacking consideration of long-distance scenes (i.e., surveillance security checks). In order to promote relevant research and fill this gap in the community, we collect a large-scale Surveillance High-Fidelity Mask (SuHiFiMask) dataset captured under 40 surveillance scenes, which has 101 subjects from different age groups with 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. In this scene, low image resolution and noise interference are new challenges faced in surveillance FAS. Together with the SuHiFiMask dataset, we propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality from three aspects: (1) An Image Quality Variable module (IQV) is introduced to recover image information associated with discrimination by combining the super-resolution network. (2) Using generated sample pairs to simulate quality variance distributions to help contrastive learning strategies obtain robust feature representation under quality variation. (3) A Separate Quality Network (SQN) is designed to learn discriminative features independent of image quality. Finally, a large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
translated by 谷歌翻译
Embedding words in vector space is a fundamental first step in state-of-the-art natural language processing (NLP). Typical NLP solutions employ pre-defined vector representations to improve generalization by co-locating similar words in vector space. For instance, Word2Vec is a self-supervised predictive model that captures the context of words using a neural network. Similarly, GLoVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics. Such word embedding has significantly boosted important NLP tasks, including sentiment analysis, document classification, and machine translation. However, the embeddings are dense floating-point vectors, making them expensive to compute and difficult to interpret. In this paper, we instead propose to represent the semantics of words with a few defining words that are related using propositional logic. To produce such logical embeddings, we introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee," thus being human-understandable. We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks. Furthermore, we investigate the interpretability of our embedding using the logical representations acquired during training. We also visualize word clusters in vector space, demonstrating how our logical embedding co-locate similar words.
translated by 谷歌翻译
The surrogate loss of variational autoencoders (VAEs) poses various challenges to their training, inducing the imbalance between task fitting and representation inference. To avert this, the existing strategies for VAEs focus on adjusting the tradeoff by introducing hyperparameters, deriving a tighter bound under some mild assumptions, or decomposing the loss components per certain neural settings. VAEs still suffer from uncertain tradeoff learning.We propose a novel evolutionary variational autoencoder (eVAE) building on the variational information bottleneck (VIB) theory and integrative evolutionary neural learning. eVAE integrates a variational genetic algorithm into VAE with variational evolutionary operators including variational mutation, crossover, and evolution. Its inner-outer-joint training mechanism synergistically and dynamically generates and updates the uncertain tradeoff learning in the evidence lower bound (ELBO) without additional constraints. Apart from learning a lossy compression and representation of data under the VIB assumption, eVAE presents an evolutionary paradigm to tune critical factors of VAEs and deep neural networks and addresses the premature convergence and random search problem by integrating evolutionary optimization into deep learning. Experiments show that eVAE addresses the KL-vanishing problem for text generation with low reconstruction loss, generates all disentangled factors with sharp images, and improves the image generation quality,respectively. eVAE achieves better reconstruction loss, disentanglement, and generation-inference balance than its competitors.
translated by 谷歌翻译